Mandera County
NRC VAD Lexicon v2: Norms for Valence, Arousal, and Dominance for over 55k English Terms
Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D) (also referred to in social cognition research as Competence (C)). These dimensions impact various aspects of our lives from social competence and emotion regulation to success in the work place and how we view the world. We present here the NRC VAD Lexicon v2, which has human ratings of valence, arousal, and dominance for more than 55,000 English words and phrases. Notably, it adds entries for $\sim$25k additional words to v1.0. It also now includes for the first time entries for common multi-word phrases (~10k). We show that the associations are highly reliable. The lexicon enables a wide variety of research in psychology, NLP, public health, digital humanities, and social sciences. The NRC VAD Lexicon v2 is made freely available for research through our project webpage.
Statistical Mechanics of Semantic Compression
The basic problem of semantic compression is to minimize the length of a message while preserving its meaning. This differs from classical notions of compression in that the distortion is not measured directly at the level of bits, but rather in an abstract semantic space. In order to make this precise, we take inspiration from cognitive neuroscience and machine learning and model semantic space as a continuous Euclidean vector space. In such a space, stimuli like speech, images, or even ideas, are mapped to high-dimensional real vectors, and the location of these embeddings determines their meaning relative to other embeddings. This suggests that a natural metric for semantic similarity is just the Euclidean distance, which is what we use in this work. We map the optimization problem of determining the minimal-length, meaning-preserving message to a spin glass Hamiltonian and solve the resulting statistical mechanics problem using replica theory. We map out the replica symmetric phase diagram, identifying distinct phases of semantic compression: a first-order transition occurs between lossy and lossless compression, whereas a continuous crossover is seen from extractive to abstractive compression. We conclude by showing numerical simulations of compressions obtained by simulated annealing and greedy algorithms, and argue that while the problem of finding a meaning-preserving compression is computationally hard in the worst case, there exist efficient algorithms which achieve near optimal performance in the typical case.
RideKE: Leveraging Low-Resource, User-Generated Twitter Content for Sentiment and Emotion Detection in Kenyan Code-Switched Dataset
Etori, Naome A., Gini, Maria L.
Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2\%) and F1 score (66.1\%), XLM-R semi-supervised (67.2\% accuracy, 64.1\% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8\%) and F1 score (31\%), mBERT semi-supervised (accuracy (59\% and F1 score 26.5\%). AfriBERTa models show the lowest accuracy and F1 scores. All models tend to predict neutral sentiment, with Afri-BERT showing the highest bias and unique sensitivity to empathy emotion. https://github.com/NEtori21/Ride_hailing
Machines of Meaning
One goal of Artificial Intelligence is to learn meaningful representations for natural language expressions, but what this entails is not always clear. A variety of new linguistic behaviours present themselves embodied as computers, enhanced humans, and collectives with various kinds of integration and communication. But to measure and understand the behaviours generated by such systems, we must clarify the language we use to talk about them. Computational models are often confused with the phenomena they try to model and shallow metaphors are used as justifications for (or to hype) the success of computational techniques on many tasks related to natural language; thus implying their progress toward human-level machine intelligence without ever clarifying what that means. This paper discusses the challenges in the specification of "machines of meaning", machines capable of acquiring meaningful semantics from natural language in order to achieve their goals. We characterize "meaning" in a computational setting, while highlighting the need for detachment from anthropocentrism in the study of the behaviour of machines of meaning. The pressing need to analyse AI risks and ethics requires a proper measurement of its capabilities which cannot be productively studied and explained while using ambiguous language. We propose a view of "meaning" to facilitate the discourse around approaches such as neural language models and help broaden the research perspectives for technology that facilitates dialogues between humans and machines.
What fifty-one years of Linguistics and Artificial Intelligence research tell us about their correlation: A scientometric review
There is a strong correlation between linguistics and artificial intelligence (AI), best manifested by deep learning language models. This study provides a thorough scientometric analysis of this correlation, synthesizing the intellectual production during 51 years, from 1974 to 2024. It involves 5750 Web of Science-indexed articles published in 2124 journals, which are written by 20835 authors belonging to 13773 research centers in 794 countries. Two powerful software, viz., CiteSpace and VOSviewer, were used to generate mapping visualizations of the intellectual landscape, trending issues and (re)emerging hotspots. The results indicate that in the 1980s and 1990s, linguistics and AI research was not robust, characterized by unstable publication over time. It has, however, witnessed a remarkable increase of publication since then, reaching 1478 articles in 2023, and 546 articles in January-March timespan in 2024, involving emerging issues and hotspots, addressing new horizons, new topics, and launching new applications and powerful deep learning language models including ChatGPT.
WorryWords: Norms of Anxiety Association for over 44k English Words
Anxiety, the anticipatory unease about a potential negative outcome, is a common and beneficial human emotion. However, there is still much that is not known, such as how anxiety relates to our body and how it manifests in language. This is especially pertinent given the increasing impact of anxiety-related disorders. In this work, we introduce WorryWords, the first large-scale repository of manually derived word--anxiety associations for over 44,450 English words. We show that the anxiety associations are highly reliable. We use WorryWords to study the relationship between anxiety and other emotion constructs, as well as the rate at which children acquire anxiety words with age. Finally, we show that using WorryWords alone, one can accurately track the change of anxiety in streams of text. The lexicon enables a wide variety of anxiety-related research in psychology, NLP, public health, and social sciences. WorryWords (and its translations to over 100 languages) is freely available. http://saifmohammad.com/worrywords.html
Investigating Idiomaticity in Word Representations
He, Wei, Vieira, Tiago Kramer, Garcia, Marcos, Scarton, Carolina, Idiart, Marco, Villavicencio, Aline
Idiomatic expressions are an integral part of human languages, often used to express complex ideas in compressed or conventional ways (e.g. eager beaver as a keen and enthusiastic person). However, their interpretations may not be straightforwardly linked to the meanings of their individual components in isolation and this may have an impact for compositional approaches. In this paper, we investigate to what extent word representation models are able to go beyond compositional word combinations and capture multiword expression idiomaticity and some of the expected properties related to idiomatic meanings. We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese), presenting a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels, their paraphrases and their occurrences in naturalistic and sense-neutral contexts, totalling 32,200 sentences. We propose this set of minimal pairs for evaluating how well a model captures idiomatic meanings, and define a set of fine-grained metrics of Affinity and Scaled Similarity, to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity. The results obtained with a variety of representative and widely used models indicate that, despite superficial indications to the contrary in the form of high similarities, idiomaticity is not yet accurately represented in current models. Moreover, the performance of models with different levels of contextualisation suggests that their ability to capture context is not yet able to go beyond more superficial lexical clues provided by the words and to actually incorporate the relevant semantic clues needed for idiomaticity.
Generative linguistics contribution to artificial intelligence: Where this contribution lies?
This article aims to characterize Generative linguistics (GL) contribution to artificial intelligence (AI), alluding to the debate among linguists and AI scientists on whether linguistics belongs to humanities or science. In this article, I will try not to be biased as a linguist, studying the phenomenon from an independent scientific perspective. The article walks the researcher/reader through the scientific theorems and rationales involved in AI which belong from GL, specifically the Chomsky School. It, thus, provides good evidence from syntax, semantics, language faculty, Universal Grammar, computational system of human language, language acquisition, human brain, programming languages (e.g. Python), Large Language Models, and unbiased AI scientists that this contribution is huge, and that this contribution cannot be denied. It concludes that however the huge GL contribution to AI, there are still points of divergence including the nature and type of language input.
The fusion of phonography and ideographic characters into virtual Chinese characters -- Based on Chinese and English
The characters used in modern countries are mainly divided into ideographic characters and phonetic characters, both of which have their advantages and disadvantages. Chinese is difficult to learn and easy to master, while English is easy to learn but has a large vocabulary. There is still no language that combines the advantages of both languages and has less memory capacity, can form words, and is easy to learn. Therefore, inventing new characters that can be combined and the popularization of deep knowledge, and reduce disputes through communication. Firstly, observe the advantages and disadvantages of Chinese and English, such as their vocabulary, information content, and ease of learning in deep scientific knowledge, and create a new writing system. Then, use comparative analysis to observe the total score of the new language. Through this article, it can be concluded that the new text combines the advantages of both pictographic and alphabetical writing: new characters that can be combined into words reduces the vocabulary that needs to be learned; Special prefixes allow beginners to quickly guess the approximate category and meaning of unseen words; New characters can enable humans to quickly learn more advanced knowledge.